First report of lyrics dataset¶
This first report will an exploration of the lyrics dataset and it will contain:
After preprocessing the data to contain useful features, I will use analytical approach to find some dependencies or correlations to be able create a meaningful model based on my findings in the next report.
Input data¶
The lyrics dataset contains popular songs with lyrics. There are six columns: id, song name, year of publishing, interpret name, assigned genre and the lyrics. Some rows are music composition without lyrics. There are hyphens instead of spaces in names of songs and interprets. Only one of these columns is numeric - the year. The other columns are strings and will be used as categorical variables.
In this section, we will show what the data looks like and try to understand its basics before further processing.
This is what the dataframe looks like:
| song_name | year | interpreter | genre | lyrics | |
|---|---|---|---|---|---|
| 0 | ego-remix | 2009 | beyonce-knowles | Pop | Oh baby, how you doing? You know I'm gonna cut right to the chase Some women were made but me, m... |
| 1 | then-tell-me | 2009 | beyonce-knowles | Pop | playin' everything so easy, it's like you seem so sure. still your ways, you dont see i'm not su... |
| 2 | honesty | 2009 | beyonce-knowles | Pop | If you search For tenderness It isn't hard to find You can have the love You need to live But if... |
| 3 | you-are-my-rock | 2009 | beyonce-knowles | Pop | Oh oh oh I, oh oh oh I [Verse 1:] If I wrote a book about where we stand Then the title of my bo... |
| 4 | black-culture | 2009 | beyonce-knowles | Pop | Party the people, the people the party it's popping no sitting around, I see you looking you loo... |
| ... | ... | ... | ... | ... | ... |
| 362232 | who-am-i-drinking-tonight | 2012 | edens-edge | Country | I gotta say Boy, after only just a couple of dates You're hands down, outright blowing my mind I... |
| 362233 | liar | 2012 | edens-edge | Country | I helped you find her diamond ring You made me try it on and everything Tomorrow you'll both say... |
| 362234 | last-supper | 2012 | edens-edge | Country | Look at the couple in the corner booth Looks a lot like me and you She's looking out at the wind... |
| 362235 | christ-alone-live-in-studio | 2012 | edens-edge | Country | When I fly off this mortal earth And I'm measured up by depth and girth The Father says now what... |
| 362236 | amen | 2012 | edens-edge | Country | I heard from a friend of a friend of a friend that You finally got rid of that girlfriend You fi... |
362237 rows × 5 columns
The dataset contains 362237 songs, but only 266557 of them contain any lyrics. This means that it is quite big data (it is one of the largest datasets), especially after featurization of the text. Therefore, I will be doing basic analysis using all samples, but some analyses and modelling will be done only using the songs containing lyrics.
Next, you can see a few charts showing the distribution of values in individual columns. For better readability, I have replaced "-" in song and interpreter names with spaces and capitalized first letters.
We can see that most songs were published in 2006 or 2007, so we should be aware that the results reflect rather just the specific culture of those times.
The table also suggests that there might be some data quality issues regarding the release year since there are 10 songs with the year 2038 and are also a few from a distant past.
The dataset is quite biased with its genre representation, so the results cannot be easily generalized to all music. We also have 29814 songs where the genre is not available and 23683 songs where the genre is labeled as "Other".
Number of unique interpreters: 18231
| No. of songs | No. of interpreters | |
|---|---|---|
| 0 | 1 | 4514 |
| 1 | 2 | 1388 |
| 2 | 3 | 832 |
| 3 | 10 | 763 |
| 4 | 11 | 719 |
The representation of interpreters is quite good, there are 18231 unique names. However there are a few that have a lot of songs, but 4514 only have one song and 1388 of them only have two songs.
Data processing¶
We have already made the first processing step of replacing hyphens with spaces and capitalizing first results. Now we need to take care of data quality issues and then featurize the song lyrics.
1. Data quality¶
We have already discovered a few data quality issues. Some of the most frequent data quality issues are null values, duplicates, inconsistencies or outliers. I have found no duplicate values in the dataset. So first, we will have a look at columns with null values.
song_name 2 year 0 interpreter 0 genre 0 lyrics 95680 dtype: int64
| song_name | year | interpreter | genre | lyrics | |
|---|---|---|---|---|---|
| 193957 | NaN | 2009 | Booker T And The Mg S | Jazz | All right people, the rest of the hard working All star blues brothers are gonna be out here in ... |
| 325992 | NaN | 2009 | Booker T | Jazz | NaN |
We can see that these are mostly rows with missing lyrics. Since these columns might not be very useful for further analyses or modelling, but we want to analyze word counts first, so we will just replace them with an empty string.. We can also see that there are two song names missing which we can see in the table above. Those seem to be mistakes so we can drop them.
We can notice from this table that some interpreters can sometimes collaborate with others so we need to be careful that they might also be present in other interpreter names.
Next, we will have a look at rows with suspicious release years.
| song_name | year | interpreter | genre | lyrics | |
|---|---|---|---|---|---|
| 27657 | Star | 702 | Clipse | Hip-Hop | You're my star It's such a wonder how you shine So no matter how far I'm dancing with you in my ... |
| 69708 | Anywhere Remix | 112 | Dru Hill | Hip-Hop | Here we are all alone You and me, privacy And we can do anything Your fantasy I wanna make your ... |
| 112159 | Atchim | 2038 | Anita | Rock | |
| 112160 | O Areias | 2038 | Anita | Rock | |
| 112161 | Era Uma Vez Um Cavalo | 2038 | Anita | Rock | |
| 112162 | Anita | 2038 | Anita | Rock | |
| 112163 | Todos Os Patinhos | 2038 | Anita | Rock | |
| 112164 | Joana Come A Papa | 2038 | Anita | Rock | |
| 112165 | Atirei O Pau Ao Gato | 2038 | Anita | Rock | |
| 112166 | Eu Vi Um Sapo | 2038 | Anita | Rock | |
| 112167 | Pipi Das Meias Altas | 2038 | Anita | Rock | |
| 112168 | Minhoca | 2038 | Anita | Rock | |
| 147914 | It S Over Now Remix | 112 | G Dep | Hip-Hop | What is this? Numbers in your pocket I remember when you Used to throw those things away Why do ... |
| 238541 | Come See Me Remix | 112 | Black Rob | Hip-Hop | Baby, you can come see me 'cause I need you here with me, and I'll show you what love is made of... |
| 315540 | Let S Lurk | 67 | Giggs | Hip-Hop | Verse 1: Still pulling up on smoke Skeng in my pocket Can't you see that bulge in my coat Like h... |
| 335205 | I Can T Believe | 112 | Faith Evans | Pop | [Chorus] I can't believe that love has gone away from me I can't believe that love has gone away... |
These rows also seem to be errors. Since we have enough samples, we can also exclude these rows from the dataset (most of these rows will be dropped with null values anyway).
This table revealed another data quality issue - abbreviations like let's or can't are marked the same way as if there were spaces instead of apostrophes. However, this should not be much of a problem since it is consistent between interpreters and song name so we will leave it as is.
Looking at the data also reveals that not all songs are in English. For example the following interpreter sings in German. We will need to take this into consideration when creating a model since creating a single model determining interpreter or genre out of lyrics will be greatly influenced by the language of the song. However, we will not adjust the data for this yet.
| song_name | year | interpreter | genre | lyrics | |
|---|---|---|---|---|---|
| 385 | Wer Liebe Sucht | 2006 | Daliah Lavi | Not Available | Ist das so schwer ein kleines Lächeln wenn du fühlst ein Mann gefällt dir sehr? Dann ist am Aben... |
| 386 | Es Geht Auch So | 2006 | Daliah Lavi | Not Available | Der Weg den du und ich gegangen führt mit einem Mal in's graue Niemandsland wann hat es angefang... |
| 387 | Liebeslied Jener Sommernacht | 2006 | Daliah Lavi | Not Available | Rote Schatten warf das Feuer hell wie Gold war der Tokayer als ein Fremder plötzlich vor mir sta... |
| 388 | Willst Du Mit Mir Gehn | 2006 | Daliah Lavi | Not Available | Willst Du mit mir gehn,Wenn mein Weg in Dunkel führt. Willst Du mit mir gehn, Wenn mein Tag scho... |
| 389 | Meine Art Liebe Zu Zeigen | 2006 | Daliah Lavi | Not Available | Meine Art Liebe zu zeigen das ist ganz einfach Schweigen. Worte zerstören wo sie nicht hingehöre... |
There are also inconsistencies in song lyrics and the style in which they are written. Some contain spelling mistakes, but there is not much we can do about that. We could see in the above tables that some lyrics contain parts like "Verse 1:" or "[Chorus]". This would be useful if it was everywhere, but since it is not, it would be better to remove them. We will remove all parts in [] brackets and add "Chorus" and "Verse" to stopwords used when making a model.
2. Featurization¶
Since words cannot be used for modelling as they are, we will need to create some features to be able to analyze them further. We will come back again to featurization in modelling stage since right now I cannot be sure which features we will need. But for exploration, I will create the following features:
- Word count: the total number of words in the lyrics
- Unique word count: the number of unique words in the lyrics
- Average word length: the average length of the words in the lyrics
- Word frequencies (TF-IDF featurization will be used in modelling, however, it is not so good for exploration purposes)
| word_count | unique_word_count | average_word_length | word_count_ratio | |
|---|---|---|---|---|
| mean | 166.495601 | 79.431849 | 4.158632 | 2.140290 |
| std | 167.264722 | 76.453418 | 1.062458 | 1.064717 |
| min | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 3.812500 | 1.597403 |
| 50% | 146.000000 | 77.000000 | 4.017065 | 1.949367 |
| 75% | 240.000000 | 111.000000 | 4.265896 | 2.444444 |
| max | 7914.000000 | 2746.000000 | 58.000000 | 87.000000 |
The following boxplot shows the distribution of wordcount and unique word count among genres. Some outliers have been hidden since they would make the chart difficult to read.
This boxplot seems to contain a lot of useful information. For example, we can see that there are genres that are very rarely without lyrics as well as genres that very often have no lyrics. Above all, songs labeled as "Other" rarely have any lyrics. Jazz and Electronic have at least half of their songs with no lyrics. Hip-Hop also has many songs with no lyrics, however, on average, it has the highest amount of words as well as unique words. Although there are many words in pop, there are not many unique ones.
Not Available will probably not be very useful since it just seems to be an average case.
Word length does not differ very much between genres. Generally Metal has the longest words and also the largest variance. One explanation that comes to my mind is that it might often be in a different language, but we can come back to this hypothesis in modelling stage. The word length is mostly around 4.
This is the frequency of individual words in the lyrics, after some basic stopwords were removed. The smaller stopwords version was used with some modifications, since in songs there can be quite a lot of meaning in otherwise meaningless words.
<Axes: >
We can see that term frequency does not differ very much between genres. The chart also contains some spanish words like "que", "de" and "la".
Data exploration¶
Exploration tips: genres and interprets with most songs multigenre interprets distribution of songs features (word count, number of unique words etc.) song covers and remakes (same name and lyrics, possibly with small differences) typical words or patterns for various genres and for various interprets
Next steps¶
Based on the results, in my next report, I would like to dedicate space to the following topics:
- creating a representative sample to be able to process data faster
- separating songs based on their language (either by using a library or by unsupervised clustering): this could significantly improve performance of any other model
- prediction of song genre/interpreter/year according to lyrics